10. HTML Files in Python

HTML Files in Python

HTML Files in Python 1

Quiz

With your knowledge of HTML file structure, you're going to use Beautiful Soup to extract our desired Audience Score metric and number of audience ratings, along with the movie title like in the video above (so we have something to merge the datasets on later) for each HTML file, then save them in a pandas DataFrame.

The Jupyter Notebook below contains template code that:

  • Creates an empty list, df_list , to which dictionaries will be appended. This list of dictionaries will eventually be converted to a pandas DataFrame (this is the most efficient way of building a DataFrame row by row ).
  • Loops through each movie's Rotten Tomatoes HTML file in the rt_html folder.
  • Opens each HTML file and passes it into a file handle called file .
  • Creates a DataFrame called df by converting df_list using the pd.DataFrame constructor .

Your task is to extract the title, audience score, and number of audience ratings in each HTML file so each trio can be appended as a dictionary to df_list .

The Beautiful Soup methods required for this task are:

  • find()
  • find_all()

There is an excellent tutorial on these methods ( Searching the tree ) in the Beautiful Soup documentation. Please consult that tutorial if you are stuck.

Workspace

This section contains either a workspace (it can be a Jupyter Notebook workspace or an online code editor work space, etc.) and it cannot be automatically downloaded to be generated here. Please access the classroom with your account and manually download the workspace to your local machine. Note that for some courses, Udacity upload the workspace files onto https://github.com/udacity , so you may be able to download them there.

Workspace Information:

  • Default file path:
  • Workspace type: jupyter
  • Opened files (when workspace is loaded): n/a

Solution

HTML Files In Python 2

Note: At 3:59, "empty character" was said when "empty string" was intended.